Question and data background:

Question

Employee turnover is costly. An inability to retain talent forces a company to frequently retrain employees. Additionally general lack of continuity within a company creates a variety of business challenges. This lead us to ask the question: Can we predict when an employee is likely to leave? If so we could use a targeted raise program in order to keep attrition rates low.

Dataset

Included below is a summary of the data set, prior to processing. It includes information about the amount of time an employee has spent at their current job, with their current manager, how far their commute is, and a variety of other variables that provide information on their current work environment.

## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Exploratory Data Analysis

Correlogram

Clustering Attempt

After checking our correlogram we decided to cluster with all 15 numeric values, however, our clustering was quite unsuccessful. The Withinss is displayed below.

## [1] 0.1225601

Histogram of Numeric Columns

Methods

Rpart2 Decision Tree

Data partition

Simple Classifier

This model provided an accuracy of 0.8597. However, the sensitivity (TPR) was 0.38 and the specificity (about 0.96) and this raises concerns that the model is overlearning the majority class. Thus, this made us think to balance the dataset to prevent overlearning the majority class

Balanced Dataset

## 
##  No Yes 
## 533 497

In an effort to mitigate the risk of overlearning the minority class, We trained another rpart2 model using a dataset where we oversample the minority class to make the dataset balanced. The new prevalence rate of the former minority class was now about 45%. Using this model to predict the test set, our sensitivity and specificity rates both converged around 0.72, but since accuracy dropped to 0.7466, this model was dropped.

Restricted feature space

used_vars <- c("MonthlyIncome", "TotalWorkingYears", "YearsWithCurrManager", "JobRole", "YearsAtCompany", "Age", "OverTime", "EnvironmentSatisfaction", "JobLevel", "NumCompaniesWorked")

Accuracy dropped to 0.8371. ROC was 0.69 (no increase from using the full feature set). Sensitivity and Specificity concerns persisted

Expanded maxdepth

Selected maxdepth=16, Even with adjusting the maxdepth hyperparameter, the accuracy, sensitivity, and specificity mostly stayed the same. Sensitivity and Specificity concerns persisted

Variable Importance

## rpart2 variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                          Overall
## MonthlyIncome             100.00
## JobRole                    98.08
## OverTime                   94.92
## TotalWorkingYears          76.04
## YearsWithCurrManager       69.72
## StockOptionLevel           50.87
## NumCompaniesWorked         49.75
## YearsAtCompany             46.71
## EmployeeNumber             42.49
## EnvironmentSatisfaction    41.02
## WorkLifeBalance            39.52
## DailyRate                  37.98
## JobLevel                   36.52
## Department                 30.49
## DistanceFromHome           24.08
## HourlyRate                 19.96
## EducationField             19.13
## MonthlyRate                19.05
## RelationshipSatisfaction   18.70
## YearsSinceLastPromotion    15.35
## rpart2 variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                          Overall
## StockOptionLevel         100.000
## MonthlyIncome             74.743
## YearsWithCurrManager      51.580
## TotalWorkingYears         46.193
## JobLevel                  41.304
## EnvironmentSatisfaction   39.201
## OverTime                  37.398
## DistanceFromHome          34.579
## RelationshipSatisfaction  30.854
## YearsAtCompany            24.743
## JobRole                   23.894
## YearsSinceLastPromotion   21.723
## MonthlyRate               20.630
## EmployeeNumber            18.236
## Age                       17.928
## TrainingTimesLastYear     16.318
## YearsInCurrentRole        15.690
## DailyRate                 15.567
## WorkLifeBalance           12.825
## EducationField             9.875
## rpart2 variable importance
## 
##                         Overall
## MonthlyIncome           100.000
## YearsWithCurrManager     90.451
## OverTime                 77.037
## JobRole                  73.494
## TotalWorkingYears        69.563
## YearsAtCompany           47.905
## EnvironmentSatisfaction  26.555
## JobLevel                 19.969
## NumCompaniesWorked        4.954
## Age                       0.000
## rpart2 variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                          Overall
## MonthlyIncome             100.00
## JobRole                    98.08
## OverTime                   94.92
## TotalWorkingYears          76.04
## YearsWithCurrManager       69.72
## StockOptionLevel           50.87
## NumCompaniesWorked         49.75
## YearsAtCompany             46.71
## EmployeeNumber             42.49
## EnvironmentSatisfaction    41.02
## WorkLifeBalance            39.52
## DailyRate                  37.98
## JobLevel                   36.52
## Department                 30.49
## DistanceFromHome           24.08
## HourlyRate                 19.96
## EducationField             19.13
## MonthlyRate                19.05
## RelationshipSatisfaction   18.70
## YearsSinceLastPromotion    15.35

Random Forest

Our original Random Forest was so bad that we didn’t even keep the code. It predicted that only 4 people at any point would quit. We changed every parameter we could think of, but the simple truth was we didn’t have enough data, our data was too imbalanced, and we had too many features. After re balancing the data and shoring up the feature space to test different rpart2 models, we decided to see how this would change our results with Random Forest.

Original Parameters

We used the MTry tune function to get a MTry of about 6. Additionally, we started with sample sizes of 100 and 1000 different trees. We saw increased success with this model, compared to the first RF atleast, but wanted to continue to tinker with the parameters.

Changing Parameters

We increased sample size from 100 to 200, OOB error dropped as we expected. Additionally, we dropped the number of trees from 1000 to 500.The high number of features and the relatively limited number of rows in this dataset meant that our random forest model was facing an overlearning problem. By changing the degrees of freedom ratio by restricting the number of features, we thought this would help mitigate the overlearning issue.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  No Yes
##        No  150  13
##        Yes  35  23
##                                           
##                Accuracy : 0.7828          
##                  95% CI : (0.7226, 0.8353)
##     No Information Rate : 0.8371          
##     P-Value [Acc > NIR] : 0.986238        
##                                           
##                   Kappa : 0.3609          
##                                           
##  Mcnemar's Test P-Value : 0.002437        
##                                           
##             Sensitivity : 0.6389          
##             Specificity : 0.8108          
##          Pos Pred Value : 0.3966          
##          Neg Pred Value : 0.9202          
##               Precision : 0.3966          
##                  Recall : 0.6389          
##                      F1 : 0.4894          
##              Prevalence : 0.1629          
##          Detection Rate : 0.1041          
##    Detection Prevalence : 0.2624          
##       Balanced Accuracy : 0.7248          
##                                           
##        'Positive' Class : Yes             
## 

This RF model produced accuracy = 0.78, Sensitivity=0.83, Specificity=0.78. Although using the balanced dataset improved the overlearning problem that we had, accuracy was worse than just guessing randomly.

KNN

After having only marginal success with decision trees, we decided to move to a KNN approach. We were not able to get accuracy that was significantly above the No Information Rate. We figured that because our dataset was quite imbalanced, we would get a marginally better model with KNN.

Selecting the correct k

Model Output

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   0   1
##          0 192  14
##          1   2  13
##                                           
##                Accuracy : 0.9276          
##                  95% CI : (0.8851, 0.9581)
##     No Information Rate : 0.8778          
##     P-Value [Acc > NIR] : 0.01146         
##                                           
##                   Kappa : 0.5826          
##                                           
##  Mcnemar's Test P-Value : 0.00596         
##                                           
##             Sensitivity : 0.48148         
##             Specificity : 0.98969         
##          Pos Pred Value : 0.86667         
##          Neg Pred Value : 0.93204         
##              Prevalence : 0.12217         
##          Detection Rate : 0.05882         
##    Detection Prevalence : 0.06787         
##       Balanced Accuracy : 0.73559         
##                                           
##        'Positive' Class : 1               
## 

Evaluation Metrics

After Finalizing our KNN model we evaluated based on the following additional metrics:

LogLoss:

Shown above is our LogLoss followed by the baseline. Our LogLoss of .89 is significantly lower than our baseline LogLoss rate of 1.8. This is encouraging because it means that our model is not highly confident in its classifications in the wrong direction. With such an imbalanced dataset, we were quite satisfied with this metric.

F1 Score

Our F1 Score of .62 is much higher than the baseline F1 we calculated of .2776. This is additionally encouraging given the imbalance of our dataset.

Conclusion and Future Work

Conclusion

In the end, we built a model that is quite successful at predicting attrition. Based on the business value of predicting attrition, we wanted our model to be quite certain that it would predict every person that was going to leave, even if this led to some false positives. This is important given how much more expensive it would be to retrain new employees as compared to targeted salary raises to convince current employees to stay. Our data set was challenging: it contained relatively little data and the data it did contain was quite overbalanced. Because of this imbalance, we really struggled to beat the no information rate with decision trees. Ensemble methods were messy and worse than our individual trees: even after balancing our data set and reducing the feature space. Ultimately we had the most success with a KNN model, at a relatively low K. This model is advantageous because it is uncomplicated and predicts significantly better than the no information rate and any other model we built. This model will save this company money: they will be able to predict and incentivize employees likely to leave in order to avoid retraining and high turnover.

Future Work

Future work that would benefit this model would be gathering more data. We had a small dataset that made many advanced methods challenging to use. Especially with a dataset so imbalanced, it would be quite beneficial to gather more data. Additionally, with more company specific data, this model could be modified to ensure profitability (in the sense of giving employees raises vs. letting them leave). Models that failed to meet certain metrics could be further tailored to a companies desires.